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Abstract — A point-to-point wireless communication system in 
the presence of an energy harvesting device and a rechargeable 
battery at the transmitter is considered. Both the energy and the 
data arrivals at the transmitter are modeled as Markov processes. 
Delay-limited communication is considered assuming that the 
underlying channel is block fading with memory and the channel 
state information is available at the transmitter. 

The problem of maximizing the expected total data trans- 
mitted during the transmitter's lifetime is studied under three 
different sets of assumptions regarding the information about 
the underlying stochastic processes, available at the transmitter. 
A learning theoretic approach is introduced, which does not 
consider any a priori information on the Markov processes 
governing the communication system. In addition, online and 
offline optimization problems are studied for the same setup 
assuming full statistical knowledge and causal information on 
the realizations, and non-causal knowledge in the realizations 
of the stochastic processes, respectively. Comparing the optimal 
solutions in all three frameworks the performance loss due to 
the lack of transmitter's information regarding the behaviors of 
the underlying Markov processes is identified. Numerical results 
are presented to corroborate our theoretical findings. 

Index Terms — Dynamic programming, Energy harvesting, Ma- 
chine learning, Markov processes, Optimal scheduling, Wireless 
communication 



I. Introduction 

Energy harvesting (EH) has emerged as a promising tech- 
nology to extend the lifetime of communication networks, such 
as machine-to-machine or wireless sensor networks; comple- 
menting current battery-powered transceivers by harvesting 
available ambient energy (solar, vibration, thermo-gradient, 
etc.)- As opposed to battery limited devices, an EH transmitter 
can theoretically operate over an unlimited time horizon; 
however, in practice transmitter's lifetime is limited by other 
factors and typically the harvested energy rates are quite low. 
Hence, in order to optimize the communication performance, 
with sporadic arrival of energy in limited amounts, it is 
critical to optimize the transmission policy using the available 
information regarding the energy and data arrival processes. 

There has been a growing interest in the optimization of 
EH communication systems. Prior research can be grouped 
into two, based on the information (about the energy and 
data arrival processes) assumed to be available at the trans- 
mitter. In the offline optimization framework, it is assumed 
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that the transmitter has non-causal information on the exact 
data/energy arrival instants and amounts (l}-||9). In the online 
optimization framework, the transmitter is assumed to know 
the statistics of the underlying EH and data arrival processes; 
and has causal information about their realizations [10|-|16|. 

Nonetheless, in many practical scenarios either the charac- 
teristics of the EH and data arrival processes change over time, 
or it is not possible to have reliable statistical information 
about these processes before deploying the transmitters. For 
example, in a sensor network with solar EH nodes distributed 
randomly over a forest, each node's solar EH characteristic 
will depend on its location, and will change based on the time 
of the day or the season. Moreover, non-causal information 
about the data/energy arrival instants and amounts is too opti- 
mistic in practice, unless the underlying EH process is highly 
deterministic. Hence, neither online nor offline optimization 
frameworks will be satisfactory in most practical scenarios. To 
adapt the transmission scheme to the unknown EH and data 
arrival processes, we propose a learning theoretic approach. 

We consider a point-to-point wireless communication sys- 
tem in which the transmitter is equipped with an EH device 
and a finite-capacity rechargeable battery. Data and energy 
arrive at the transmitter in packets in a time-slotted fashion. 
At the beginning of each time-slot (TS), a data packet arrives 
and it is lost if not transmitted within the following TS. 
This can be either due to the strict delay requirement of the 
underlying application, or due to the lack of a data buffer at the 
transmitter. On the other hand, harvested energy can be stored 
in a finite size battery/capacitor for future use. We assume that 
the wireless channel between the transmitter and the receiver 
is constant for the duration of a TS but may vary from one 
TS to the next. We model the data and energy packet arrivals 
as well as the channel state as Markov processes. The lifetime 
of an EH transmitter is not limited by the available energy; 
however, to be more realistic we assume that the transmitter 
might terminate its operation (due to physical limitations, such 
as failure of one of its components, blockage of its channel to 
the receiver or it might be forced to switch to the idle mode by 
the network controller) at any TS with certain probability. The 
objective of the transmitter is to maximize the average amount 
of transmitted data to the destination during its lifetime under 
the packet deadline and battery constraints. 

For this setup, we study both offline and online optimization 
problems. The solution for the offline optimization problem 
constitutes an upperbound on the online optimization and the 
difference between the two indicates the value of knowing the 
system behavior non-causally. Furthermore, we take a more 
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practically relevant approach, and assume that the statistical 
information about the underlying Markov processes is not 
available at the transmitter, and that, all the data and energy 
arrivals as well as the channel states are known causally. Under 
these assumptions, we propose a machine learning algorithm 
for the transmitter operation, such that the transmitter learns 
the optimal transmission policy over time by performing 
actions and observing their immediate rewards, and show 
that its performance converges to the solution of the online 
optimization problem as learning time increases. The main 
technical contributions of the paper are summarized as follows: 

• We provide, to the best of our knowledge, the first 
learning theoretic optimization approach to the EH com- 
munication system optimization problem under stochastic 
data and energy arrivals. 

• For the same system model, we provide a complete 
analysis by finding the optimal transmission policy for 
both the online and the offline optimization approaches 
in addition to the learning theoretic approach. 

• We show that the proposed Q-learning algorithm con- 
verges to the optimal transmission policy corresponding 
to the online optimization approach. 

• We provide a number of numerical results to corrobo- 
rate our findings, and compare the performance of the 
learning optimization approach with the offline and online 
optimization solutions. 

The rest of this paper is organized as follows. Section [II] 
is dedicated to a summary of the related literature. In Sec- 
we present the EH communication system model. 
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In Section IV we study the online optimization problem and 



characterize the optimal transmission policy through dynamic 
programming (DP) (T7J. In Section [V] we propose a learning 
theoretic approach, and show that the transmitter is able to 
learn the system stochastic dynamics and converge to the 
optimal transmission policy. The offline optimization problem 



is studied in Section VI Finally in Section VII the three 
approaches are compared and contrasted in different settings. 
Section [VIII| concludes the paper. 

II. Related Work 

There is a growing literature on the optimization of the 
EH communication systems within both the online and offline 
frameworks. Optimal offline transmission strategies have been 
characterized for point-to-point systems with both data and 
energy arrivals in [1], with battery imperfections in [2], and 
with processing energy cost in B); for various multi-user 
scenarios in Q, (4j-||7); and for fading channels in [8 |. Offline 
optimization of precoding strategies for the MIMO channel 
are studied in (9). In the online framework the system is 
modeled as a Markov decision process (MDP) and DP based 
solutions are provided. In flO) , the authors assume that the 
packets arrive as a Poisson process, and each packet has 
an intrinsic value assigned to it, which also is a random 
variable. Modeling the battery state as a Markov process, the 
authors studied the optimal transmission policy that maximizes 
the average value of the received packets at the destination. 
Under a similar Markov model as flO) , | fl5| studies the 



properties of the optimal transmission policy. In JTTJ, the 
minimum transmission error problem is addressed, where the 
data and energy arrivals are modeled as Bernoulli and Markov 
processes, respectively. Ozel et al. [8| propose online as well as 
offline approaches to the transmit data maximization problem 
with stochastic energy arrivals and fading channel. The causal 
information assumption is relaxed by modeling the system as a 
partially observable Markov decision process in 1 12| and 1 14|. 
Assuming that the data and energy arrival rates are known at 
the transmitter, tools from queueing theory are used for long- 
term average rate optimization in [16] and |13| for point-to- 
point and multi-hop scenarios, respectively. 



Similar to the present paper, references 1 18|-|21 1 optimize 
EH communication systems under mild assumptions regarding 
the statistical information available at the transmitter. In (TSJ 
a forecast method for a periodic energy harvesting process 
is considered. Reference [19] uses historical data to forecast 
energy arrival and solves a duty cycle optimization problem 
based on the expected energy arrival profile. Similarly to [ 19 1, 
the transmitter duty cycle is optimized in pO) and (2 1 1 by tak- 
ing advantage of techniques from control theory and machine 
learning, respectively. However, |19|-|21| consider only the 
issue of balancing harvested and consumed energy regardless 
of the underlying data arrival process and the cost associated 
to the data transmission. In contrast, in our problem setup we 
consider the data arrival and channel state processes along with 
the energy harvesting process, significantly complicating the 
problem at hand. 

III. System Model 

We consider a wireless transmitter equipped with an EH 
device and a rechargeable battery with limited storage capacity. 
The communication system operates in a time-slotted fashion 
over TSs of equal duration. We assume that both data and 
energy arrive in packets at each TS. The channel remains 
constant during each TS while its state changes from one 
TS to the next. We consider strict delay constraints for the 
transmission of data packets; that is, each data packet needs to 
be transmitted within the TS following its arrival. We assume 
that the transmitter has a certain small probability (1 — 7) of 
terminating its operation at each TS and it is interested in 
maximizing the sum of transmitted data during its lifetime. 

The sizes of the data/energy packets arriving at the begin- 
ning of each TS are modeled as correlated time processes 
following a first-order discrete-time Markov model. Let D n 
be the size of the data packet arriving at TS n, where 
D n € V = {di, . . . , djsr-p} and Nt> is the number of elements 
in T). Let Pd{dj,dk) be the probability of the data packet 
size process going from state dj to state dk in one TS. Each 
energy packet is assumed to be an integer multiple of a 
fundamental energy unit. Let E„ denote the amount of energy 
harvested during TS n, where E^ £ £ = {ei, . . . , ejv 6 }, 
and p e (ej, e^) is the state transition probability function. The 
energy harvested during TS n, E^ , is stored in the battery 
and can be used for data transmission at the beginning of 
TS n + 1. The battery has a limited size of B max energy 
units and all the energy harvested when the battery is full 
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is lost. Let H n be the channel state during TS n, where 
H n £ H = {hi, . . . , hisf n }. We assume that H n also follows a 
Markov model and Ph(hj, hk) is the state transition probabil- 
ity. Other works in the EH literature have considered similar 
models for the energy harvesting fTT) , fTZ) , |14| and data 
arrival processes p2) as well as channel state process fl4) , 
|22|. Moreover, Q 1 0|| also considers the case of strict deadline 
constraint and lack of data buffer at the transmitter. 

For each channel state H n and packet size D n , the trans- 
mitter knows the amount of minimum energy E„ required 
to transmit the arriving data packet to the destination. Let 
E£ = f e (D n , H n ) : T> x H — > £ u where £ u is a discrete set of 
integer multiples of the fundamental energy unit. We assume 
that if the transmitter spends E% energy units for transmission, 
the packet is transmitted successfully. 

In each TS n the transmitter knows the battery state B n , the 
size of the arriving packet D n , the current channel state H n ; 
and hence, the amount of energy needed to transmit this 
packet. At the beginning of each TS, the transmitter makes a 
binary decision: to transmit or to drop the incoming packet. 
Moreover, the transmitter must guarantee that the energy spent 
in the TS n is not greater than the energy available in the 
battery B n . Let X n £ {0, 1} be the indicator function of the 
event that the packet D n is transmitted in TS n. Then, for 
V?i £ Z, we have 



X n E n < B n , 



Br, 



mm{B n - X n El 
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(1) 

(2) 



Our goal is to maximize the expected sum of the transmitted 
data over the lifetime of the transmitter: 



max lim E 



N 



n=0 



l n X n D Tl 



(3) 



s.t. 
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where 0<1— 7<lis the independent and identically 
distributed probability of the transmitter to terminate operation 
in each TS. We call this problem the discounted sum data 
problem as the term 7 is known as discount factor in the 
literature. The energy harvesting communication system that 
is considered here is depicted in Figure [T] The case in which 
7 = 1; that is, the transmitter can continue its operation as 
long as there is available energy provided by the harvester, 
is described for completeness. In this case, contrary to the 
discounted sum data problem, (|3]l is not a practical measure 
of performance as the transmitter operates for an infinite 
amount of time; and hence, most transmission policies that 
allow a certain non-zero probability of transmission at each 
TS are optimal in the discounted sum data criterion as they 
all transmit an infinite amount of data. Hence, we focus on 
the problem of maximizing the throughput: 



max 



lim 



1 



-E 



N 



n=0 



(4) 
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s.t. and ((2J. 

We call this problem the throughput optimization problem. The 
main focus of the paper is on the discounted sum data problem, 
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Figure 1. Energy harvesting communication system with energy harvesting 
and data arrival stochastic processes as well as varying channel. 



therefore, we assume < 7 < 1 in the rest of the paper unless 
otherwise stated, and hence the rest of the paper deals with this 
problem. The throughput optimization problem is examined 



only through numerical analysis in Section VII 



An MDP provides a mathematical framework for modeling 
decision-making situations where outcomes are partly random 
and partly under the control of the decision maker [23 1. The 
EH communication system, as described above, constitutes a 
finite-state discrete-time MDP. An MDP is defined via the 
quadruplet (S,A,p Xi (sj,Sk),R Xi (sj,Sk)), where S is the set 
of possible states, A is the set of actions, p Xi (sj,Sk) denotes 
the transition probability from state Sj to state Sk when action 
Xi is taken and R Xi (sj,Sk) is the immediate reward yielded 
when in state Sj action is taken and the state changes to 
Sk- In our model the state of the system in TS n is S n , which 
is formed by four components S n = (E^ , D n , H n , B n ). 
Since all components of S n are discrete there exist a finite 
number of possible states and the set of states is denoted 
by S = {si, . . . , sn s }- The set of actions is A = {0,1} 
where actions and 1 indicate that the packet is dropped or 
transmitted, respectively. If the immediate reward yielded by 
action Xj £ A when the state changes from S n to S n+ i in 
TS n is R Xi (S n , S n +i), the objective of an MDP is to find 
the optimal transmission policy tt(-) : S — >• A that maximizes 
the expected discounted sum reward. We restrict our attention 
to deterministic stationary transmission policies. In our EH 
communication problem, the immediate reward function is 
Rx n (S n ,S n +i) = X n D n , and the expected discounted sum 
reward is equivalent to ([3]), where 7 corresponds to the 
discount factor and X n = ir(S n ) is the action taken by the 
transmitter when the system is in state S n . The interaction 
between the transmitter and the system forming an MDP is 
illustrated in Figure [2] 

Given the policy it and the current state S n , the state of the 
battery B n+ i is ubiquitously determined by (|2]). The other state 
components are randomly determined using the state transition 
probability functions. Since state transitions depend only on 
the current state and the transmitter's current action, the 
model under consideration fulfills the Markov property. As a 
consequence, we can take advantage of DP and reinforcement 
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learning (RL) [24| tools to solve the optimization problem in 
f). 

Next, we introduce the state-value function and action-value 
function which will be instrumental in solving the MDP |24|. 
The state-value function is defined as follows: 

Vs fc £5 

(5) 

It is, intuitively, the expected discounted sum reward of policy 
tt when the system is in state Sj. The action-value function, 
defined as 

Vs k es 

(6) 

is the expected discounted reward when the system is in state 
sj, takes action X{ £ A, and follows policy tt thereafter. A 
policy tt is said to be better than or equal to policy tt', denoted 
by 7r > tt', if its expected discounted reward is higher or 
equal in all states, i.e., tt > tt' if V"*(sj) > V*' (sj), Vs 3 £ 
S. The optimal policy tt* is the policy that is better than or 
equal to any other policy. Eqn. |5]l indicates that the state- 
value function V"*(S n ) can be expressed as a combination of 
the expected immediate reward and the state value function of 
the next state, V*(S n +i). The same happens with the action- 
value function. The state-value function when the transmitter 
follows the optimal policy is 



V* ( s j) = maxQ 77 (sj,Xj). 



(7) 



From (7]) we see that the optimal policy is the greedy policy; 
that is, the policy that performs the action with the highest ex- 
pected discount reward according to Q n (sj,Xj). The action- 
value function, when the optimal policy is followed, is 

Q n (sj,Xi)= \^ p Xi (sj, s fe ) \R Xl (sj, s fe )+7 max Q n (s k ,Xj)]. 

(8) 

Similarly to d5}, ([8]) indicates that the action-value function 
(S n ,Xi), when following tt*, can be expressed as a 
combination of the expected immediate cost and the maximum 
value of the action-value function of the next state. 

There are three approaches to solve the optimization prob- 
lem in ^ depending on the available information at the trans- 
mitter. If the transmitter has prior information on the values of 
Pxi(sj,Sk) and R. Xi (sj, Sk), the problem falls into the online 
optimization framework, and we can use DP to find the optimal 
transmission policy tt*. If the transmitter does not have prior 
information on the values of p Xi (sj, Sk) or R Xi (sj, Sk) we can 
use a learning theoretic approach based on RL. By performing 
actions and observing their rewards, RL tries to arrive at an 
optimal policy tt* which maximizes the expected discounted 
sum reward accumulated over time. Alternatively, in the offline 
optimization framework, it is assumed that all future EH states 
E„ , packet sizes D n and channel states H n are known non- 
causally over a finite horizon. 

IV. Online Optimization 

In this section we consider the online optimization frame- 
work in which case we assume that the transmitter knows the 
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Figure 2. Transmitter interaction with the system stochastic processes 
forming an MDP. 



state transition probabilities p Xi (sj, Sk), the immediate reward 
function R Xi (sj, Sk), and additionally has causal information 
of the state of the system S n . We employ policy iteration 
(PI) (25) , a DP algorithm, to find the optimal policy in (J3J. The 
MDP problem in Q has finite action and state spaces as well 
as bounded and stationary immediate reward functions. Under 
these conditions PI is proven to converge to the optimal policy 
when < 7 < 1 (25). The key idea is to use the structure of 
PJ, (|6| and (j7]i to obtain the optimal policy. PI is based on 
two steps: 1) policy evaluation, and 2) policy improvement. 

In the policy evaluation step the value of a policy tt is evalu- 
ated by computing the value function V^lsj). In principle, (5]) 
is solvable but at the expense of laborious calculations when 
S is large. Instead, PI uses an iterative method [24|; given tt, 
Pxi{sj,Sk) and R Xi (sj,Sk), the state value function V^(sj) 
is estimated as 

V r( s j)=^2P-K( Sj )( s i> s k) [ R 7r( Sj )(sj,s k ) + 7^-1 (*fe)] »(9) 

for all Sj £ S, where I is the iteration number of the estimation 
process. It can be shown that the sequence Vf(sj) converges 
to V*(sj) as I — > oo when < 7 < 1. With policy 
evaluation, one evaluates how good a policy tt is by computing 
its expected discounted reward at each state Sj £ S. 

In the policy improvement step, the PI algorithm looks 
for a policy tt' that is better than the previously evaluated 
policy tt. The Policy Improvement Theorem [ fT7| states that 
if Q n (sj,TT'(sj)) > V^isj) for all Sj £ S then tt' > tt. 
Policy improvement step finds the new policy tt' by applying 
the greedy policy to Q 7r (sj,Xi) in each state. Accordingly, the 
new policy tt' is selected as follows: 



Tr'(sj) — argmaxQ 7r (sj, Xi). 

XiGA 



(10) 



PI works iteratively by first evaluating V^(sj), finding a 
better policy tt' , then evaluating V* (sj), and finding a better 
policy tt", and so forth. When the same policy is found in 
two consecutive iterations we conclude that the algorithm 
has converged. The exact embodiment of the algorithm, as 
described in (24), is given in Algorithm [T] The performance 
of the proposed algorithm and the comparison with other 
approaches will be given in Section |VII| 

Remark 1. We want to point out that different from the 



solutions of the MDPs in [10], the optimal policy in our 
problem does not have a "threshold" formulation; because 
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Algorithm 1 Policy Iteration (PI) 



1. Initialize: 

for each Sj £ S do 

initialize V(sj) and n(sj) arbitrarily 
end for 

2. Policy evaluation: 
repeat 

A <- 

for each Sj e 5 do 

v <r- V( Sj ) 

Vfa) E^P^oOj^fc) [-R7T( 3j )(sj,s fe ) +7V(s ft )] 
A-«-max(A,||«-y(s 3 -)||) 
end for 
until A < e 

3. Policy improvement: 

policy-stable true 
for each Sj £ S do 

6 <- 7r(Sj) 

7r(sj) argmax^.g^ J] Sfc p aj (aj,s fc ) [ik, (sj, s fc ) + jV(s k )] 

it b ^ 7r(sj) then 

policy-stable false 
end if 
end for 

4. Check stoping criteria: 

if policy-stable then 

stop 
else 

go to 2). 
end if 



in our model the energy cost of transmitting each packet is 
different and depends on the channel state as well as the packet 



V. Learning Theoretic Approach 

In this section we consider the problem setup in Sec- 
tion 



III assuming that the transmitter has no knowledge of the 



transition probabilities p x .(sj,Sk) and the immediate reward 
function R Xi [sj, Sk). We use Q-learning, a learning technique 
originating from RL, to find the optimal transmission policy. 
Q-learning relies only on the assumption that the underlying 
system can be modeled as an MDP and that in each learn- 
ing iteration, after taking action X n , the immediate reward 
Rx n (S n , Sn+i) as we H as the state of the system S n +i 
are known causally. The immediate reward, in our particular 
problem, is the size of the transmitted packet D n ; hence, it 
is readily known at the transmitter. Eqn. |6) indicates that 
Q K {S n ,Xj) of the current state-action pair can be represented 
in terms of the expected immediate reward of the current 
state-action pair and the state-value function V*(S n +i) of the 
next state. Note that Q* (sj,Xi) contains all the long term 
consequences of taking action ie, in state Sj when following 
policy 7T*. Thus, one can take the optimal actions by looking 
only at Q n (sj, Xi) and choosing the action that will yield the 
highest expected reward (greedy policy). As a consequence, by 
only knowing Q* (sj,Xi), one can derive the optimal policy 
7r* without knowing p Xi ( s j> s fc) or Rxtisj, Sfc). Based on this 
relation, the Q-learning algorithm finds the optimal policy by 
estimating Q v (sj,Xi) in a recursive manner. In the ra-th learn- 
ing iteration (sj,Xi) is estimated by Q n (sj,Xi), which is 



done by weighting the previous estimate Q n -i{sj, Xi) and the 
estimated expected value of the best action of the next state 
S n+ i. In each TS, the algorithm 

• observes the current state S n = Sj € S, 

• selects and performs an action X n = xi E A, 

• observes the next state S n +i = Sk S S and the immediate 
reward R Xi (s ,s k ), 

• updates its estimate of Q 7 ' (sj,Xj) using 

Qn ) — (-1 ^n)Qn-l {$j : Xi)-\- 



a n [Rxi(sj,s k ) 



jma,x XjeA Q n _ 1 (s k ,x :j )\, 

where a„ is the learning rate factor in the n-th learning 
iteration. If all actions are selected and performed with non- 
zero probability, < 7 < 1 and the sequence a n fulfills 
certain constraints^] the sequence Q n (sj,Xi) is proven to 
converge to Q 7r (sj,Xi) with probability 1 as n — >• 00 |26|. 

With Q n (sj, at hand the transmitter has to decide for a 
proper transmission policy to follow. We recall that, in case 
Q* (sj,Xi) is perfectly estimated by Q n (sj,Xi), the optimal 
policy is the greedy policy. However, there might be inaccu- 
racies in the estimate of Q v (sj,Xi) and as a consequence 
the greedy policy is no longer optimal. In order to estimate 
(sj,Xi) accurately, all actions in all states must be taken a 
sufficient amount of times. To this end, the transmitter balances 
the exploration of new actions with the exploitation of known 
actions. In exploitation the transmitter follows the greedy 
policy; however, if only exploitation occurs optimal actions 
might remain unexplored. While, in exploration the transmitter 
takes actions randomly with the final aim of discovering 
better policies and enhancing the estimate of Q v (sj, Xi). The 
e-greedy action selection method for balancing exploration and 
exploitation is as follows. At each learning iteration either 
explore (take actions randomly) with probability e or exploit 
(follow the greedy policy) with probability 1 — e, where 
< e < 1. 

For more details about RL and Q-leaming the reader is 
referred to p4) and p7| . The specific embodiment of Q- 
learning is presented in Algorithm [2] In Section VII the 
performance of Q-learning in our problem setup is evaluated 
and compared to other approaches. 

VI. Offline Optimization 



In this section we consider the problem setup in Section III 
assuming that all the future data/energy arrivals as well as the 
channel variations are known non-causally at the transmitter 
before the transmission starts. Offline optimization is relevant 
in applications for which the underlying stochastic processes 
are deterministic and are known at the transmitter. In general 

'The constraints on the learning rate a n follow from well-known results 
in stochastic approximation theory. Denote by Q n fc( s . x .-\ the learning rate 
a n corresponding to the fc-th time action Xi is selected in state sj. The 
constraints on a n are < a„ < 1, J^'kLo a n k (sj = 00 and 
X]fcLo a ^fc( s x .) ^ 00 ' ^ ^ and ^ "A- m ^ ne secon d condition is 
required to guarantee that the algorithm's steps are large enough to overcome 
any initial condition. The third condition guarantees that the steps become 
small enough to assure convergence. Although the use of sequences a n that 
meet these conditions assures convergence in a theoretical framework, they 
are rarely used in practical applications. 
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Algorithm 2 Q-learning 



1. Initialize: 

for each Sj £ S, Xi £ A do 

initialize Q{sj,Xi) arbitrarily 
end for 

evaluate the starting state Sj <s— So 

2. Learning: 
loop 

select action X n following the e-greedy action selection method 

perform action x% <— X n 
observe the next state Sk 4— S„+i 
receive an immediate cost R Xi (sj, Sk) 
select the action Xj corresponding to the m&x Xj Q(su,Xj) 
update the Q(,Sj,Xi) estimate as follows: 
Q(sj,Xi) <- 

(1 - a„)Q(sj,Xi) + a„[R Xi (sj,s k ) +-ymsLX. Xj Q(s k ,Xj)] 



update the current state Sj 
end loop 



Sk 



the solutions of the corresponding offline optimization problem 
can be considered as an upperbound on the performance of the 
online and the learning theoretic problems. Offline approach 
optimizes the transmission policy over a realization of the 
MDP for a finite number of TSs, whereas the learning theoretic 
and online optimization approaches optimize the expected 
value over an infinite horizon. We recall that an MDP real- 
ization is a sequence of state transition realizations of the data 
and energy harvesting as well as the channel state processes 
for a finite number of TSs. Given an MDP realization in the 
offline optimization approach we optimize X n such that the 
discounted sum of transmitted data is maximized. From ([3]) 
the offline optimization problem can be written as follows 



JV 



max ^ Y l X n D n 



n=0 

s.t. X n E^ < B ni 

B n +i < B n — X n E^ + , 
^ Bn B max , 
X n €{0,l}, n = 0,...,N, 



(12a) 

(12b) 
(12c) 
(12d) 
(12e) 



where B = {B , .. ., B N } and X = (X , .. ., X N ). Note 
that we have replaced the equality constraint in ([2} with two 



inequality constraints, namely ( 12c i and 
problem in (12i is a relaxed version of 



12dj). Hence, the 
To see that the 

two problems are indeed equivalent, we need to show that 
any solution to ( p"2} is also a solution to Q. If the optimal 
solution to ( [T2"} satisfies ( 12c i or (12di with equality, then it 
is a solution to (|3]l as well. Assume that X, B is an optimal 
solution to ( fT2| ) and that for some n, B n fulfills both of the 
constraints ( |12c[ > and (12di with strict inequality whereas the 
other components satisfy at least one constraint with equality. 
In this case, we can always find a £?+ > B n such that at 
least one of the constraints is satisfied with equality. Since 
B+ > B n , (12b I is not violated and X remains to be 



feasible, achieving the same objective value. In this case, X 
is feasible and a valid optimal solution to ([3]l as well, since 
B+ satisfies 



problem (MILP) since it has affine objective and constraint 
functions, while the optimization variable X n is constrained 
to be binary. This problem is known to be NP-hard; however, 
there are algorithms combining relaxation tools with smart 
exhaustive search methods to reduce the solution time. Notice 
that, if one relaxes the binary constraint on X n to < X n < 1, 
( |T2] ) becomes a linear programming (LP) problem. We call the 
optimization problem in ( 12 1 the complete-problem and its 



relaxed version the LP-problem. We define O as the feasible 
set for the complete-problem and 1Z as the feasible set for the 
LP-problem. Two properties are of interest. First, since O is a 
subset of 1Z, the optimal value of the LP-problem provides an 
upper bound on the complete-problem. Secondly, if an optimal 
solution of the LP-problem belongs to O it is also an optimal 
solution to the complete-problem. 

Most available MILP solvers employ an LP based branch- 
and-bound algorithm (28). Branch-and-bound |29| works by 
generating disjunctions; that is to partition the feasible set O of 
the complete-problem into smaller subsets Ok and to explore 
each subset Ok recursively. The algorithm maintains a list C 
of active subproblems over all the active subsets Ok created. 
Let CsP(A:) be the active subproblem over the fc-th subset 
Ok- The objective value of any feasible solution to CsP(/c) is 
a lowerbound to the objective value of the complete-problem. 
The feasible solution along all the subproblems CsP(fc) with 
the highest objective is called the incumbent and its objective 
value is denoted by I max . Let X fc be the optimal solution, 
and Ik its objective value corresponding to the LP-problem 
version of CsP(/c). There are three options: 1) If X fc <E Ok, the 
complete problem and the LP-problem have the same solution. 
We update I max = max{4,/ maj .} and all subproblems in 
C such that Ik < I m ax are discarded; 2) If X fe £ O k and 
Ik < Imax, then the optimal solution of CsP(fc) can not 
improve I max and the subproblem CsP(fc) is discarded, and 
3) If X fc ^ Ok and I k > I max , then CsP(fc) requires further 
exploration, which is done by branching, i.e., creating two 
new subproblems of CsP(fc) by dividing its feasible set Ok- 
A simple branching procedure is as follows. Assume that the 
n-th element of X fc , denoted by X„, is not in Ok, then we 
can formulate a logical disjunction for the n-th element of the 
optimal solution X n £ Ok as 



X n < LX£j ORX n > fx*i, 



(13) 



The problem in (12 I is a mixed integer linear optimization 



where ["■] and ['J are the integer upper and lower parts, 
respectively. With this logical disjunction the algorithm creates 
two new subsets O^ and Ok", one associated with each of the 
linear constraints, which divide Ok in two. The two subprob- 
lems, CsP(fc') and CsP(A:"), associated to the new subsets 
Ok' and Ok", respectively, replace CsP(/c) in C. Notice that, 
with the binary constraints of our particular setting, the logical 
disjunction (13i becomes X n — OR X n = 1 and the new 
subproblems, CsP(fc') and CsP(fc"), will assign X n to either 
zero or one, respectively. The highest optimal value of the 
LP-problem version associated with the active subproblems 
in £ is a valid upperbound on the complete-problem. The 
algorithm terminates when the incumbent and the upperbound 
are equal, in which case C is empty. The basic branch-and- 
bound algorithm is given in Algorithm [3] In our numerical 
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Algorithm 3 Branch and Bound 







1. Initialize: 

£ = {complete problem in £>}, I m 

2. Terminate: 
if L = then 

X-max is optimal 
end if 

3. Select: 

choose and delete a subproblem CsP(fc) form C 

4. Evaluate: 

solve a LP-problem version of CsP(fe) 
if LP-problem is infeasible then 

go to Step 2 
else 

let Ik be its optimal value and Xfe its solution 
end if 

5. Prune: 

if I\ < Imax then 

go to Step 2 
else if X fe ^ Ok then 

go to Step 6 
else 

Imax Ik fnd X ma!c Xfe 

delete all subproblems CsP(n) in C with I n < Imax 
end if 

6. Branch: 

divide CsP(fc) in two subproblems, CsP(fc') and CsP(A:'' 

add them to C 

set I k > «- Jfe and I V i «- I k 

go to Step 3 



and 



analysis in Section VII we provide the optimal performance 
for the offline optimization approach using the above branch- 
and-bound algorithm as well as the upperbound derived using 
the LP relaxation. 



Remark 2. The LP relaxation of ( 12 1 corresponds to the prob- 
lem in which the transmitter does not make binary decisions, 
and is allowed to transmit the packets partially. It is assumed 
in this case that transmitting a portion of packet D„, with 
< a < 1, requires aE^ energy. In principle, DP and RL 
ideas can also be applied to problems with continuous state 
and action spaces; however, exact solutions are possible only 
in special cases. A common way of obtaining approximate 
solutions with continuous state and action spaces is to use 
function approximation techniques J24) . 

Remark 3. Notice that, unlike the online and learning theoretic 
optimization, the offline optimization approach is not restricted 
to the case where < 7 < 1. Hence, both the branch-and- 
bound algorithm and the LP relaxation can be applied to the 
throughput optimization problem in Q. 

VII. Numerical Results 

To compare the performance of the three approaches that we 
have proposed, we focus on a sample scenario of the EH com- 



munication system presented in Section III We are interested 



in comparing the expected performance of the approaches 
proposed. For the online optimization approach it is possible to 
evaluate the expected performance of the optimal policy tt*, 
found using the DP algorithm, by solving |5]l or evaluating 
d9| and averaging over all possible starting states So E S. In 



theory, the learning theoretic approach must achieve the same 
performance as the online optimization approach^] however, 
in practice the transmitter can learn only for a finite number 
of TSs and the transmission policy it arrives at depends on 
the specific realization of the MDP The offline optimization 
approach optimizes over a realization of the MDP. To find the 
expected performance of the offline optimization approach one 
has to average over infinite realizations of the MDP for an 
infinite number of TSs. 

In practice to evaluate the expected performance of the 
proposed algorithms we average the achieved performance 
of the different approaches over a finite set of MDP real- 
izations for finite number of TSs. Equivalently, we assume 
that the performance of the proposed algorithms is a random 
variable and use the sample mean to estimate its expected 
value. Accordingly, to provide a measure of accuracy for our 
estimators, we also compute the confidence intervals. The 
details of the confidence interval computations are relegated 
to the Appendix. 

In our numerical analysis the following setup is considered 
for the stochastic processes governing the energy harvesting, 
the data arrival and the channel state. We assume that the 
transmitter at each TS either harvests two units of energy or 
does not harvest any, i.e., £ ~ {0,2}. We denote p e (2,2), 
the probability of harvesting two energy units in TS n given 
that the same amount was harvested in TS n — 1, by ph- In 
our simulations we will study the effect of pn on the system 
performance and the convergence behavior of the learning 
algorithm. We set p e (0,0), the probability of not harvesting 
any energy in TS n when no energy was harvested in TS n— 1, 
to 0.9. The battery size is set to B max ~ 5 energy units. The 
possible packet sizes are D n € T> = {1,2} data units with 
state transition probabilities pd(l,l) = Pd{2,2) = 0.9. Let 
the channel state at TS n be H„ € % = {1, |} and the state 
transition probability function is Ph(l, 1) = Ph{h, \) = 0.9. 

To find the required energy to reliably transmit a data 
packet over the channel we consider the Shannon's capacity 
formula for Gaussian channels. The transmitted data in TS n 
of duration A is 



D n = A\og 2 (l + H n P), 



(14) 



where P is the transmit power. In low power regime, which 
is of special practical interest in the case of energy harvesting 
devices, the capacity formula can be approximated as linear; 
i.e., D n ~ AH n P where AP is the energy expended in 
transmission in the TS n; and, since the energy expended is 
measured in energy units, D n is measured in data units. Then, 
the minimum energy required for transmitting a packet D n is 
given by E^ = f e (D n , H n ) = j^. In general we assume that 
the transmit energy for each packet at each channel state is an 
integer multiple of the energy unit. In our special case this con- 
dition is satisfied as we have £ u = {1,2,4}. Numerical results 
for the discounted sum data problem, in which the transmitter 
might terminate its operation, are given in Section |VII-A| for 
7 = 0.9 whereas the throughput optimization problem (7 = 1) 
is examined in Section [VII-BI 

2 This is true only if < 7 < 1. 
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A. Discounted Sum Data Problem 

For evaluation and comparison purposes we generate T = 
2000 realizations of TV = 100 random state transitions and 
examine the performance of the proposed algorithms. In partic- 
ular we consider the LP relaxation of the offline optimization 
problem, the offline optimization problem with the branch- 
and-bound algorithm^] the online optimization problem with 
PI, the learning theoretic approach with Q-learning and 
finally, a greedy algorithm which assumes a causal knowledge 
of B n , D n and H n , and transmits a packet whenever there is 
enough energy in the battery. 

Notice that the LP relaxation solution is an upper bound on 
the performance of the offline optimization problem, which, 
in turn, is an upper bound on the online problem. At the 
same time the performance of the online optimization problem 
is an upper bound on the learning theoretic and the greedy 
approaches. 

In Figure [3] we illustrate, together with the performance of 
the other approaches, the expected sum of transmitted data 
by the learning theoretic approach against the time evolution. 
We can see that after 200 TSs the learning algorithm reaches 
a 90% of the performance achieved by online optimization, 
while after 2-10 5 TSs the performance is 99.5% of the optimal. 
We can conclude that the learning theoretic approach is able 
to learn the optimal policy as the number of TSs increases. 
We also observe from Figure [3] that the performance of the 
greedy algorithm is notably inferior compared to the other 
approaches. 



— Offlim-r.p 

Offline 

Online 
• Learning 
— Greedy 



0.025 0.05 



Time evolution (TS ) « 1 ooo 

Figure 3. Expected value of the discounted sum of transmitted data over the 
learning theoretic approach learning time, pu = {0.5} and 7 = 0.9. 

Figure |4] displays the expected sum of transmitted data 
for different pu values. For the learning theoretic approach 
we show the performance after having learned for 10 4 TSs 
since we consider that after this learning time the learning 
algorithm has been able to learn a transmission policy close to 
the optimal. As expected, performances of all the approaches 
increase as the average amount of harvested energy increases 
with ph ■ It can be seen that the online approach achieves 80% 
of the performance of the offline approach when p^ = 0.5, 
while for pu = 0.9 it reaches 89%. This is due to the fact that 
the underlying Markov process governing the energy arrivals 

3 Reference |28 | presents a survey on software tools for MILP problems. 
In this paper we use the branch-and-bound toolbox provided in 1301. 



becomes less random as pn increases; and hence, the online 
algorithm can better estimate its future states and adapt to it. 

Since the learning theoretic approach is upper bounded by 
the online optimization approach it has a similar behavior. 
Its performance achieves 90% of the online optimization for 
Ph = 0.5 and 97% for pu = 0.9. The Q-learning algorithm 
learns faster and performs better when the underlying Markov 
processes are less random. 

Additionally, we observe from Figure[4]that the performance 
of the greedy approach is about 50% of the offline approach. 




Figure 4. Expected value of the discounted sum of transmitted data for 
p H = {0.5, . . . , 0.9} and 7 = 0.9. 



B. Throughput Optimization Problem 

In the online and learning theoretic formulations, the 
throughput optimization problem in (|4| falls into the category 
of average reward maximization problems, which cannot be 
solved with Q-learning unless a finite number of TSs is 
specified, or the presence of absorbing states in the MDP is 
considered. Alternatively, one can take advantage of average 
reward RL algorithms. Nevertheless, the convergence proper- 
ties of these methods are not yet well understood. An average 
reward RL algorithm is R-learning pT) , which similarly to 
Q-leaming, estimates an adjusted version of the action-value 
function in ([6]). On the contrary to Q-learning, R-learning is 
not proven to converge. 

Similarly, for the online optimization problem the PI al- 
gorithm cannot be used either, since the policy evaluation 
step is not guaranteed to converge. Instead we use relative 
value iteration (RVI) J32) , which is a DP algorithm to find the 
optimal policy in average reward MDP problems. 

In our numerical analysis for the throughput optimization 
problem we consider the LP relaxation of the offline opti- 
mization problem, the offline optimization problem with the 
branch-and-bound algorithm, the online optimization problem 
with RVI, the learning theoretic approach with R-learning] 
and finally, the greedy algorithm. For evaluation purposes we 
average over T — 2000 realizations of N = 100 random state 
transitions. 

In Figure [5] we illustrate, together with the performance of 
the other approaches, the throughput achieved by the learning 
theoretic approach against the time evolution. We observe that 
after 200 TSs the learning algorithm reaches a 89% of the 



4 We use the e-greedy action selection mechanism with e 
learning rate to a = 0.5. 



7% and set the 



5 We use the same action selection method as Q-learning in Section VII-A 
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performance achieved by online optimization, while after 2 • 
10 5 TSs the performance is 93% of the performance of the 
online optimization approach. Notably the learning theoretic 
theoretic approach performance increases with the number of 
learning iterations; however, in this case the performance does 
not converge to the performance of the online optimization 
approach. We also observe from Figure |5]that the performance 
of the greedy algorithm is notably inferior compared to the 
other approaches. 



-O i l: . .1.1' 

■Offline 
- Online 

■Learning 
Greedy 



Time evolution (TS) xiooo 

Figure 5. Average throughput over learning theoretic learning approach 
learning time for pn =0.5 and 7 = 1. 

Figure [6] displays the throughput for different p# values. 
Similarly to Section VII-A we show the learning theoretic 
approach performance after having learned for 10 4 TSs. As 
expected, performances of all the approaches increase as the 
average amount of harvested energy increases with pn- It 
can be seen that the online approach achieves 92% of the 
performance of the offline approach when pu = 0.5, while 
for ph = 0.9 it reaches 96%. This is in line with our finding 
in Figure |4] The throughput achieved by the learning theoretic 
approach achieves 92% of online optimization throughput for 
p H = 0.5 and 97% for p H = 0.9. Similarly to the Q- 
learning algorithm in Figure [4] the R-learning performance, 
compared to online and offline optimizations, increases when 
the underlying Markov processes are less random. Similarly to 
the discounted sum data problem, the greedy algorithm shows 
a performance well below the other approaches. In general we 
observe that, besides the fact that the convergence properties 
are not well understood when 7 = 1, the R-learning algorithms 
has a similar behavior to Q-learning. 




Figure 6. Average throughput for pjj 



0.9} and 7 : 



VIII. Conclusions 

We have considered a point-to-point communication system 
with an EH wireless transmitter, a rechargeable battery with 
limited capacity and strict deadline constraints. Our model 
includes stochastic data/energy arrivals and time varying chan- 
nel, all modeled by Markov processes. We have identified 
the discounted sum data problem; that is, the problem of 
maximizing the amount of data transmitted during the trans- 
mitter's lifetime. Regarding the information available at the 
transmitter about the underlying stochastic processes, online, 
learning theoretic and offline optimization approaches have 
been studied. For the learning theoretic and the online opti- 
mization approaches the communication system is modeled as 
an MDR and the corresponding optimal transmission policies 
have been identified. It has been shown that the learning theo- 
retic approach reaches the optimal performance of the online 
optimization approach as the learning time goes to infinity. 
The offline optimization problem has been identified to be a 
mixed-integer linear programming problem, and the optimal 
as well as the linear-programming relaxation-based solutions 
have been found. Our numerical results have shown that, after 
10 4 learning iterations the learning theoretic approach reaches 
more than 90% of the performance of the online optimization 
approach. Accordingly, we have shwon that smart and energy- 
aware transmission policies can raise the performance from 
a 50% up to a 90% of the performance of the offline opti- 
mization approach. In addition to addressing the discounted 
sum data problem we have also addressed the throughput 
optimization problem and made similar observations despite 
the lack of theoretical convergence results. 

Appendix 



In the discounted sum data problem we are interested in 
estimating X = E lim^v-i-oo Yln=a l n X n D n , where X n is 
the action taken by the transmitter which is computed using 
either the offline, online optimization or the learning theoretic 
approach and D n is the packet size in the n-th TS. An upper 
bound on X can be found as 



X < E 



JV 

J2 7"^n A, 



7 



JY 



(15) 



which follows by assuming that after TS N all packets arriving 
at the transmitter are of size D m . dx > dj for all dj € T>, that 
there is enough energy to transmit all the arriving packets, 
and that, < 7 < 1. Notice that the error e^v decreases as an 
exponential function of N. Then X is constrained by 



X N < X < X 



N 



(16) 



Now that we have gauged the error ejy due to not considering 
an infinite number of TSs in each MDP realization, we 
consider next the error due to estimating Xn over a finite 
number of MDP realizations. We can rewrite Xn as 



X 



N 




(17) 



71=0 
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where X l n and D* correspond to the action taken and data 
size in the TS n of the t-th MDP realization, respectively. 
We denote by Xjj the sample mean estimate of X N for T 
realizations as: 

*£ = ^e(e«^V as) 

t=0 \n=0 / 

Using the Central Limit Theorem, if T is large, we can assume 
that X^ is a random variable with normal distribution and 
by applying the Tchebycheff inequality we can compute the 
confidence intervals for Xjj 

P(Xl -e T <X N <Xl + e T ) = 5, (19) 

where = ti+a (T)^=, with t a (b) denoting the Student— t a 
percentile for b samples and the variance a is estimated using 




Finally, the confidence interval for the estimate Xj^ of X is 

P{Xjj - e T < X < Xl + e T + ejv) = 5. (21) 

In our numerical analysis we compute the confidence inter- 
vals for 5 = 0.9. 

Remark 4. In the throughput optimization problem we as- 
sume that, given the stationarity of the underlying Markov 
processes, the expected throughput achieved in a sufficiently 
large number of TS is the same as the expected throughput 
over an infinite horizon. Thus, by setting e^v to zero, the 
computation of the confidence intervals for the throughput 
problem is analogous to the discounted sum data problem. 

References 

[1] J. Yang and S. Ulukus, "Optimal packet scheduling in an energy 

harvesting communication system," IEEE Trans. Commun., vol. 60, 

no. 1, pp. 220-230, Jan. 2012. 
[2] B. Devillers and D. Giindiiz, "A general framework for the optimization 

of energy harvesting communication systems," J. of Commun. and 

Nerworks, Special Issue on Energy Harvesting in Wireless Networks, 

vol. 14, no. 2, pp. 130-139, Apr. 2012. 
[3] O. Orhan, D. Gunduz, and E. Erkip, "Throughput maximization for an 

energy harvesting communication system with processing cost," in IEEE 

Information Theory Workshop (ITW), Lausanne, Switzerland, Sep. 2012. 
[4] K. Tutuncuoglu and A. Yener, "Sum-rate optimal power policies for 

energy harvesting transmitters in an interference channel," J. of Commun. 

and Nerworks, Special Issue on Energy Harvesting in Wireless Networks, 

vol. 14, no. 2, pp. 151-161, Apr. 2012. 
[5] M. A. Antepli, E. Uysal-Biyikoglu, and H. Erkal, "Optimal packet 

scheduling on an energy harvesting broadcast link," IEEE J. Sel. Areas 

Commun., vol. 29, no. 8, pp. 1712-1731, Sep. 2011. 
[6] C. Huang, R. Zhang, and S. Cui, "Throughput maximization for the 

Gaussian relay channel with energy harvesting constraints," ArXiv e- 

prints, Sep. 2011. 

[7] D. Gunduz and B. Devillers, "Multi-hop communication with energy 
harvesting," in International Workshop on Computational Advances in 
Multi-Sensor Adaptive Processing ( CAMSAP), San Juan, PR, December 
2011. 

[8] O. Ozel, K. Tutuncuoglu, J. Yang, S. Ulukus, and A. Yener, "Transmis- 
sion with energy harvesting nodes in fading wireless channels: Optimal 
policies," IEEE J. Sel. Areas Commun., vol. 29, no. 8, pp. 1732-1743, 
Sep. 2011. 



[9] M. Gregori and M. Payaro, "Optimal power allocation for a wireless 
multi-antenna energy harvesting node with arbitrary input distribution," 
in International Workshop on Energy Harvesting for Communication 
(ICC 12 WS - EHC), Ottawa, Canada, Jun. 2012. 

[10] J. Lei, R. Yates, and L. Greenstein, "A generic model for optimizing 
single-hop transmission policy of replenishable sensors," IEEE Trans. 
Wireless Commun., vol. 8, no. 4, pp. 547-551, Apr. 2009. 

[1 1] Z. Wang, A. Tajer, and X. Wang, "Communication of energy harvesting 
tags," IEEE Trans. Commun., vol. 60, no. 4, pp. 1159-1166, Apr. 2012. 

[12] H. Li, N. Jaggi, and B. Sikdar, "Relay scheduling for cooperative 
communications in sensor networks with energy harvesting," IEEE 
Trans. Wireless Commun., vol. 10, no. 9, pp. 2918-2928, Sep. 2011. 

[13] Z. Mao, C. E. Koksal, and N. B. Shroff, "Near optimal power and rate 
control of multi-hop sensor networks with energy replenishment: Basic 
limitations with finite energy and data storage," IEEE Trans. Autom. 
Control, vol. 57, no. 4, pp. 815-829, Apr. 2012. 

[14] C. K. Ho and R. Zang, "Optimal energy allocation for wireless com- 
munications with energy harvesting constraints," ArXiv e-prints, Mar. 
2011. 

[15] A. Sinha and P. Chaporkar, "Optimal power allocation for a renewable 

energy source," in Communications (NCC), 2012 National Conference 

on, Kharagpur, India, feb. 2012, pp. 1-5. 
[16] R. Srivastava and C. E. Koksal, "Basic tradeoffs for energy management 

in rechargeable sensor networks," submited to IEEE/ ACM Trans. Netw., 

Jan. 2011. 

[17] R. E. Bellman, Dynamic Programming. Princeton, N.J.: Princeton 

University Press, 1957. 
[18] A. Kansal and M. B. Sirvastava, "An enviromental energy harvesting 

framework for sensor networks," in International Symposium on Low 

Power Electronics and Design (ISPLED), Tegernsee, Germany, Aug. 

2003. 

[19] J. Hsu, A. Kansal, S. Zahedi, M. B. Srivastava, and V. Raghunathan, 
"Adaptive duty cycling for energy harvesting systems," in International 
Symposium on Low Power Electronics and Design (ISPLED), Seoul, 
Korea, Oct. 2006, pp. 180-185. 

[20] C. M. Vigorito, D. Ganesan, and A. G. Barto, "Adaptive control of 
duty cycling in energy-harvesting wireles sensor networks," in IEEE 
Communications Society Conference on Sensor, Mesh and Ad Hoc 
Communications and Networks (SECON), San Diego, Ca, USA, 2007, 
pp. 21-30. 

[21] C. H. Roy, C.-T. Liu, and W.-M. Lee, "Reinforcement learning-based 
dynamic power management for energy harvesting wireless sensor 
network," in Next-Generation Applied Intelligence, ser. Lecture Notes 
in Computer Science, B.-C. Chien, T.-P. Hong, S.-M. Chen, and M. Ali, 
Eds. Springer Berlin / Heidelberg, 2009, vol. 5579, pp. 399-408. 

[22] A. Aprem, C. R. Murthy, and N. B. Mehta, "Transmit power control with 
ARQ in energy harvesting sensors: A decision-theoretic apporach," in 
To appear in IEEE Globecom 2012, Anaheim, CA, USA, Dec. 2012. 

[23] R. E. Bellman, "A Markovian Decision Process," Journal of Mathemat- 
ical Mechanics, vol. 6, no. 5, pp. 679-684, 1957. 

[24] R. S. Sutton and A. G. Barto, Reinforcement Learing: An Introducition, 
A. B. Book, Ed. Cambridge, MA: MIT Press, 1998. 

[25] R. Howard, Dynamic Programming and Markov Processes (Technology 
Press Research Monographs). The MIT Press, 1960. 

[26] C. J. Watkins, "Learning from delayed rewards," Ph.D. dissertation, 
University of Cambridge, Psychology Department., 1989. 

[27] C. J. Watkins and P. Dayan, "Technical note Q-learning," Machine 
Learning, vol. 8, pp. 279-292, Jan. 1992. 

[28] A. Atamtiirk and M. W. P. Savelsberg, "Integer-programming software 
systems," Annals of Operations Research, vol. 140, no. 1, pp. 67-124, 
Nov. 2005. 

[29] K. G. Murty, Operations research : deterministic optimization models. 
Upper Saddle River, NJ: Prentice Hall, 1995. 

[30] M. Berkelaar, K. Eikland, and P. Notebaert, "Open source (mixed- 
integer) linear programming system: lpsolve v. 5.0.0.0," [Available: 
http://lpsolve.sourceforge.net], 2004. 

[31] A. Schwartz, "A reinforcement learning method for maximizing undis- 
counted rewards," in Proceedings of the Tenth International Conference 
in Machine Learning, San Mateo, CA, USA, 1993. 

[32] M. L. Putterman, Markov Decision Processes: Discrete Stochastic 
Dynamic Programming. USA: Wiley-Interscience, 2005. 



